Mastering Web Scraping for Data Collection

MethodsNET Workshop

Aurélien Goutsmedt, Thomas Laloux and Marine Bardou (UCLouvain)

2024-11-01

1 Training Goals

Motivations

First Session:

  • Giving a basic understanding of what web scraping is and what it can do
  • Discussing ethical (and legal) issues linked to web scraping
  • Proposing a roadmap for understanding how to proceed for practising web scraping using R
  • Providing bits of codes and practical tips


Second Session:

  • Hands-on practice with different exercises by level of difficulty

Download the documents

https://github.com/agoutsmedt/methodsnet_scraping/

Requisits

  • Need of R and RStudio for the second session (please make sure to install them!)
  • These slides are built from a .qmd (quarto) document \(\Rightarrow\) all the codes used in these slides can be run in RStudio
# These lines of code has to be run before if you want to install all the packages directly

# pacman will be used to install (if necessary) and load packages
if(length(grep("pacman", installed.packages())) == 0) install.packages("pacman") # check if installed
library(pacman)

# Installing the needed packages in advance
p_load(tidyverse, # basic suite of packages
       glue, # useful for building string (notably for url)
       scico, # color palettes
       patchwork, # for juxtaposition of graphs
       DT) # to display html tables

2 What is web Scraping

What is web scraping ?

  • The web scraping is a method for extracting data available in the World Wide Web
  • The World Wide Web, or “Web”, is a network of websites (online documents coded in html and css)
  • A web scraper is a program, for instance in R, that automatically read the html structure of a website and extract the relevant content (text, hypertext references, tables)
    • No need to fully understand html and css
  • Useful when many pages to scrape

What is HTML and CSS?

API vs. web scraping

  • API (Application Programming Interface) provides a structured and predictable way to retrieve data from a service. It’s like ordering from a menu; you request specific data and receive it in a structured format
  • Web Scraping is the process of programmatically extracting data from the web page’s HTML itself. It’s akin to manually copying information from a book; you decide what information you need and how to extract it

API vs. web scraping

  • Control and Structure: APIs offer structured access to data, whereas web scraping requires parsing HTML and often cleaning the data yourself.
  • Ease of Use: Using an API can be simpler since it’s designed for data access (but not always the case). Scraping requires dealing with HTML changes and is more prone to breaking.
  • Availability: Not all websites offer an API, making web scraping a necessity in some cases.
  • Limitations and Authorization: APIs often have rate limits and may require authentication, but approve access to the data. Web scraping can bypass these limits but might violate terms of service.

Forget about big data, small data is everywhere!

  • A large possibility of data you can collect:
    • official documents/speeches
    • agenda and meetings
    • list of personnel or experts in commission
    • laws or negotiations
  • To take into account the development of pages over time, we can do it through the Internet Archive

Building databases

Involves a series of questions:

  • What’s your research question and which data would be appropriate to answer it?
  • How much data to collect?
    • Trade-off between collecting a lot of information (which requires more time) and risking to miss some information at a later step
  • How to scrape the data? In which format?
    • Interaction between extracting data properly in a first step or cleaning it in a second step
  • What do you loose by doing it automatically rather than manually? (or the reverse)
  • How to analyse/understand my new data?
  • How to update my database?

3 The Ethics of Web Scraping

Ethical considerations

  • Legal Considerations: Not all data is free to scrape. Websites’ terms of service may explicitly forbid web scraping, and in some jurisdictions, scraping can have legal implications
    • What is “forbidden” by a website is not necessary “illegal”
  • Privacy Concerns: Scraping personal data can raise significant privacy issues and may be subject to regulations like GDPR in Europe
  • Website Performance: Scraping, especially if aggressive (e.g., making too many requests in a short period), can negatively impact the performance of a website, affecting its usability for others

Questions at stake (Krotov, Johnson, and Silva 2020)

Krotov, Johnson, and Silva (2020)

Krotov, Johnson, and Silva (2020)

Ethical practices

  • Respect robots.txt: This file on websites indicates which parts should not be scraped
  • Rate Limiting: Making requests at a reasonable rate to avoid overloading the website’s server
  • User-Agent String: Identifying your scraper can help website owners understand the nature of the traffic
  • Data Use: Consider the ethical implications of how scraped data is used. Ensure it respects the privacy and rights of individuals

4 How to scrape a website?

What is R?

  • Programming language created at the end of 1990s for statistical computing
  • Free and open source
  • Enriched by a large set of packages
  • Generally used in the development environment RStudio

R basic concepts

  • object: like a box to which you assign a name and put various things in
  • data frame (it is a specific type of object!): comparable to a spreadsheet
firstnames <- c("Anna", "Laura", "Lise")
firstnames
[1] "Anna"  "Laura" "Lise" 
ages <- c(26,28,25)
ages
[1] 26 28 25
discipline <- c("maths", "sociology", "law")
discipline
[1] "maths"     "sociology" "law"      
my_data <- data.frame(firstnames, ages, discipline)
my_data
  firstnames ages discipline
1       Anna   26      maths
2      Laura   28  sociology
3       Lise   25        law

R basic concepts

  • function: a block of code to which you provide input and which returns an output. You can define yourself a function, but many are pre built.
length(firstnames)
[1] 3
  • package: most importantly, packages contain functions!
library(stringr)
str_detect(firstnames, "Anna")
[1]  TRUE FALSE FALSE

Useful packages for webscraping in R

  • rvest: navigating website, scraping and cleaning html code
  • polite: responsible web etiquette (informing the website that you are scraping)
  • RSelenium: using a bot to interact with a website
p_load(rvest, # scraping and manipulating html pages
       polite, # scraping ethically
       RSelenium) # scraping by interacting with RSelenium

The Role of Sitemaps

  • Sitemap: to inform search engines about URLs on a website that are available for web crawling
    • Understand the structure of a website
    • Find where is the information we want to extract

Being respectful of the website

Declaring yourself:

session <- polite::bow(bis_website_path, 
                       user_agent = "polite R package - used for academic training by 
                       Aurélien Goutsmedt (aurelien.goutsmedt[at]uclouvain.be)")
cat(session$robotstxt$text)
#Format is:
#       User-agent: <name of spider>
#       Disallow: <nothing> | <path>
#-------------------------------------------

User-Agent: *
Disallow: /dcms
Disallow: /metrics/
Disallow: /search/
Disallow: /staff.htm
Disallow: /embargo/
Disallow: /app/
Disallow: /goto.htm
Disallow: /login
#Disallow: /cbhub
Disallow: /cbhub/goto.htm
Disallow: /doclist/
# Committee comment letters
Disallow: /publ/bcbs*/
Disallow: /bcbs/ca/
Disallow: /bcbs/commentletters/
Disallow: /*/publ/comments/
# Hide the Basel Framework standards, only chapters should be indexed.
Disallow: /basel_framework/standard/

Sitemap: https://www.bis.org/sitemap.xml
session$robotstxt$sitemap
    field useragent                           value
1 Sitemap         * https://www.bis.org/sitemap.xml

Using sitemap

Code
# This function goes to a sitemap page, and extract all the urls found
extract_url_from_sitemap <- function(url, delay = 1) { 
  urls <- read_html(url) %>% 
    html_elements(xpath = ".//loc") %>% 
    html_text()
  Sys.sleep(delay) # You set a delay to avoid overloading the website
  return(urls)
}

# insistently allows to retry when you did not succeed in loading the page
insistently_extract_url <- insistently(extract_url_from_sitemap, 
                                       rate = rate_backoff(max_times = 5)) 

document_pages <- extract_url_from_sitemap(session$robotstxt$sitemap$value) %>% 
  .[str_detect(., "documents")] # We keep only the URLs for documents

bis_pages <- map(document_pages[1:5], # showing the code just on the first five years
                 ~insistently_extract_url(url = ., 
                                          delay = session$delay))

bis_pages <- tibble(year = str_extract(document_pages[1:5], "\\d{4}"),
                    urls = bis_pages) %>% 
  unnest(urls)

The key steps of web scraping

There are many ways to scrape, but your baseline scenario is:

  1. Read the html code of the page (you can access this page with an url directly, or after interacting with it using RSelenium).

  2. Extract some specific elements in that page using selectors.

  3. Read the text (or some other info) behind contained in these elements.

  4. Store the information retrieved in a data frame.

Scraping a BIS speech with rvest

Scraping one page: using a scrape helper

  • Scraping add-ons on browser help you navigating through elements in a webpage
    • XPath is the path towards a specific part of a webpage
    • CSS selectors are first for styling web pages, but allows to match position of an element within HTML structures
  • Typical scraping helpers: ScrapeMate and SelectorGadget

Scraping a BIS speech with rvest

Scraping a BIS speech with rvest

url_speech <- "https://www.bis.org/review/r241022f.htm"
page <- read_html(url_speech)
print(page)
{html_document}
<html class="no-js" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
[1] <head>\n<meta content="IE=edge" http-equiv="X-UA-Compatible">\n<meta cont ...
[2] <body>\n<div class="dt tagwidth" id="body">\n<div id="bispage">\n<noscrip ...
page %>% 
  html_element("h1") %>%
  html_text
[1] "Joachim Nagel: Dot plots for the Eurosystem?"
page %>% 
  html_element("#extratitle-div p:nth-child(1)") %>% 
  html_text
[1] "Speech by Dr Joachim Nagel, President of the Deutsche Bundesbank, at Harvard University, Cambridge, 22 October 2024."
page %>% 
  html_elements(".Reden") %>% 
  html_text
[1] "Ladies and gentlemen,"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
[2] "it is a great pleasure to be at Harvard again, to meet long time companions like Hans-Helmut Kotz and to exchange ideas with top scientists such as Benjamin Friedman. When I was in this round two years ago, we were dealing with an unprecedented global inflation spike. Fortunately, the worst is behind us, and inflation in the euro area is heading back to the Eurosystem's target. We have not brought the inflation ship safely back into the 2% harbour, but the port is in sight. Thus, I can focus on another question today."                                                                                                                                                                
[3] "Before I do that, let me share an analogy to set the stage for my discussion. Back in the 1970s and 1980s, the field of economics was split into two seemingly incompatible schools of thought: New Keynesian and New Classical. Their proponents were not too polite in their language, calling assumptions \"foolishly restrictive\" or comparing an opponent to someone attempting to pass himself off as Napoleon Bonaparte. But, over time, ideas from both camps ultimately merged to form a consensus called the New Neoclassical Synthesis, the very foundation of modern macroeconomics. Gregory Mankiw neatly described this story in his essay \"The Macroeconomist as Scientist and Engineer\"."
[4] "The takeaway from this analogy is that complex issues are rarely black or white. With this in mind, I want to explore whether the conduct of monetary policy in the euro area could be enhanced by offering more detailed and nuanced information regarding its future outlook. More specifically, today I will address the following question: Should the Eurosystem introduce dot plots?"                                                                                                                                                                                                                                                                                                                 

Scraping BIS: understanding URLs

day <- "01"
month <- "10"
year <- 2024 # we want to look at all the speech since October 1st 2024
page <- 2
# The package glue allows to insert variables in a text string 
url_second_page <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")
print(url_second_page)
https://www.bis.org/cbspeeches/index.htm?fromDate=01%2F10%2F2024&cbspeeches_page=2&cbspeeches_page_length=25

Scraping the query page: mixing rvest and RSelenium

# Launch Selenium to go on the website of bis
driver <- rsDriver(browser = "firefox", # can also be "chrome"
                   chromever = NULL,
                   port = 4444L) 
remote_driver <- driver[["client"]]

Scraping one page: mixing rvest and RSelenium

remote_driver$navigate(url_second_page)
Sys.sleep(session$delay)


element <- remote_driver$findElement("css selector", ".item_date")
element$getElementText()[[1]]
[1] "24 Oct 2024"


elements <- remote_driver$findElements("css selector", ".item_date")
length(elements)
[1] 25
elements[[25]]$getElementText()[[1]]
[1] "15 Oct 2024"

Scraping one page

Code
data_page <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    url = remote_driver$findElements("css selector", ".dark") %>% 
                      map_chr(., ~.$getElementAttribute("href")[[1]])) %>% 
  separate(info, c("title", "description", "speaker"), "\n")

Scraping all the pages

starting_url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page=1&cbspeeches_page_length=25")
remote_driver$navigate(starting_url)

# Extract the total number of pages
nb_pages <- remote_driver$findElement("css selector", ".pageof")$getElementText()[[1]] %>%
  str_remove_all("Page 1 of ") %>%
  as.integer()

# creating a list objet to allocate progressively information
metadata <- vector(mode = "list", length = nb_pages)

for(page in 1:nb_pages){
  url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")
  remote_driver$navigate(url)
  nod <- nod(session, url) # introducing politely to the new page
  Sys.sleep(session$delay) # using the delay time set by polite

  metadata[[page]] <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          url = remote_driver$findElements("css selector", ".dark") %>% 
                            map_chr(., ~.$getElementAttribute("href")[[1]])) 
}

metadata <- bind_rows(metadata) %>% 
  separate(info, c("title", "description", "speaker"), "\n")
driver$server$stop() # we close the bot once we've finished

5 Exercises

Exercises

Easy exercise

  • You want to scrape and analyze results of the 2024 UK General Election from the BBC website.
  • Objectives:
    • Count the number of parties with at least one seat.
    • Determine which parties won or lost compared with the previous election.
    • Calculate and visualize the average votes per seat for each parties with at least one seat.
    • Compare findings for the entire UK vs. England.

Easy exercise

Approach

  1. Use the rvest and polite packages to retrieve data from the BBC website for party names, seats, votes, and seat changes for all parties in the UK.

  2. Organize data into a dataframe and clean it. Converte seat and vote counts to numeric and remove extraneous symbols.

  3. Analysis:

  • Count number of parties with at least one seat.
  • Order the parties according to seat gains/losses.
  • Calculate votes per seat for each party with a least one seat.
  • Plot the number of votes per seat for all parties with a least one seat
  1. Repeat the process for parties in England and then compare the results between the UK and England.

Medium exercise part 1

You want to know what happened to the files which were EU legislative priorities in 2023-2024

  • 1. Scrape the basic information

We are going to list all relevant procedures. In the EU, once proposed each piece of legislation has a procedure number, including ‘COD’.

Go to the page that lists the legislative files which were priorities for 2023-24

You have to scrape this page to obtain a data frame, in which there will be:

  • the title
  • number
  • url towards the specific page of each procedure

Check one or two links

  • can you copy paste them in a browser and access the page?
  • Is there anything missing in the URL? How could you fix this?

Search manually a procedure to find out how urls are made: https://oeil.secure.europarl.europa.eu/oeil/search/search.do?searchTab=y

Tip: you can use the paste() function.

Medium exercise part 2

  • 2. Filter only the procedures of interest

Now you have listed the names of all relevant procedures, and the links to access them. Create a data frame that contains only procedure with ‘COD’ in their reference number.

Tip: you can use the the str_detect() function of stringr.

Medium exercise part 3

  • 3. Scrape a single page Take this single URL link: https://oeil.secure.europarl.europa.eu/oeil/popups/ficheprocedure.do?reference=2021/0433(CNS)&l=en. It is one of the ones you have listed

In a separate data frame (which will have only one line, and three columns), scrape:

  • the status of the procedure (i.e. at which stage it is)

  • the date at which the legislative file was published

  • the date at which the EP took its decision

Tip: for the dates first, select all the dates. Then, select the names of all the events to which they correspond. Finally, select your event of interest with grepl() (use for example “proposal”)

Medium exercise part 4

4. Writing a function Write a function that automates the scraping you did at question 3. (generalize your code!). For each URL, the function has to scrape the same three pieces of information. Run that function and store the results in a data frame that also contains the number of the procedures and their URLs.

Tips:

  • Explanations about creating a function in R here: https://www.r-bloggers.com/2022/04/how-to-create-your-own-functions-in-r/

  • In the function, you can use the tibble() function to bind the different information together.

  • At the end, use return(). This indicates to R that it is the output of the function.

  • When applying the function, you may find the lapply() function useful!

  • Don’t hesitate to test the function on one or a handful of URLs!

Some of the info you are looking for may not be on all pages. Use the function length() to check whether your code found something, and write “To check” if the information is not found. Why is some info missing on some pages?

Medium exercise part 5

5. Explore the data

  • Duration Calculate the number of days between the legislative proposal and the EP decision in a new column of your data frame.

Tip: you have to tell R that you are working with dates. Search which function allows to do this!

  • Missing data What happens to the cases where the date of EP decision is not yet available? Pay attention when calculating the duration!

  • Spotting specific cases Let’s look for the longer process. When did that procedure start?

Difficult exercise

Scraping the ECB occasional papers

  • Goal: Create a database with the titles, authors, abstracts, JEL-codes, URLs, and date of publication of all the ECB occasional papers. Find the most recurrent words and expressions in the abstracts and titles.
  • Approach: first look at the page and understand the structure of the page.
    • Is the page fully loaded when you access it? If not, you will need to scroll down to access all the papers.
    • Is the abstract or the JEL-codes visible? If not, you will need to click on a button.
    • Considering this, you may need to use RSelenium to scrape the page.
  • Be careful to handle the particular structure of some information: for instance, you want to extract all the authors individually for each paper.
  • Tip: check that you always have the right number of information

6 Resources

Useful resources

References

Krotov, Vlad, Leigh Johnson, and Leiser Silva. 2020. “Tutorial: Legality and Ethics of Web Scraping.”